110 PART 3 Getting Down and Dirty with Data

»

» Spot-checking data entry: If doing data entry from forms or printed material,

choose a percentage to double-check (for example, 10 percent of the forms

you entered). This can help you tell if there are any systematic data entry

errors or missing data.

Creating a File that Describes

Your Data File

Every research database, large or small, simple or complicated, should include a

data dictionary that describes the variables contained in the database. It is a neces-

sary part of study documentation that needs to be accessible to the research team.

A data dictionary is usually set up as a table (often in Excel), where each row pro-

vides documentation for each variable in the database. For each variable, the dic-

tionary should contain the following information (sometimes referred to as

metadata, which means “data about data”):»

» A variable name (usually no more than ten characters) that’s used when

telling the software what variables you want it to use in an analysis»

» A longer verbal description of the variable in a human-readable format (in

other words, a person reading this description should be able to understand

the content of the variable)»

» The type of data (text, categorical, numerical, date/time, and so on)

If numeric: Information about how that number is displayed (how many

digits are before and after the decimal point)

If date/time: How it’s formatted (for example, 12/25/13 10:50pm or

25Dec2013 22:50)

If categorical: What codes and descriptors exist for each level of the

category (these are often called picklists, and can be documented on a

separate tab in an Excel data dictionary)»

» How missing values are represented in the database (99, 999, “NA,”

and so on)

Database programs like SQL and statistical programs like SAS often have a func-

tion that can output information like this about a data set, but it still needs to be

curated by a human. It may be helpful to start your data dictionary with such out-

put, but it is best to complete it in Excel. That way, you can add the human cura-

tion yourself to the Excel data dictionary, and other research team members can

easily access the data dictionary to better understand the variables in the database.